Characterization of microsatellite markers in the coding regions of the Penaeus vannamei genome

In this study, an extensive analysis of microsatellite markers (Single Tandem Repeats—STRs) in Penaeus vannamei was conducted at an advanced level. The markers were thoroughly examined, characterized, and specific markers located within coding regions were identified. Out of a total of 306 STRs, 117 were classified as perfect markers based on their single repeat motif. Among these perfect markers, 62 were found to be associated with predicted coding genes (mRNA), which were involved in various functions such as binding, catalytic activity, ATP-dependent activity, transcription, structural and molecular regulation. To validate the accuracy of the findings, a sample of nine markers was subjected to in vitro testing, which confirmed the presence of polymorphisms within the population. These results suggest the existence of different protein isoforms within the population, indicating the potential of these markers for application in both population and phenotype-genotype association studies. This innovative approach opens up new possibilities for investigating the impact of genomic plasticity in populations of P. vannamei.


Introduction
An inherent characteristic of marine shrimp production systems is the increase in inbreeding within populations, which leads to a higher genetic similarity among individuals.Consequently, this can result in undesired production traits, such as increased susceptibility to pathogens [1][2][3].Genetic selection plays a crucial role in identifying desirable phenotypes and maintaining genetic diversity, both of which are essential for successful shrimp farming [4,5].

Microsatellite search on Penaeus vannamei
Literature on STRs in P. vannamei was obtained from the NCBI (https://www.ncbi.nlm.nih.gov/) using the following ENTREZ query: "microsatellite repeats" [

MeSH Terms] OR ("microsatellite"[All Fields] AND "repeats"[All Fields]) OR "microsatellite repeats"[All Fields] OR "microsatellites"[All Fields]) AND shrimp[All Fields] AND ("penaeidae"[MeSH Terms] OR "penaeidae"[All Fields] OR "penaeus"[All Fields]) AND vannamei[All Fields].
Based on this investigation, a comprehensive database of STRs was established.These markers were categorized based on their type, whether they were perfect or composite, as well as their specific motif and number of repeats.The database also included information about the position of each marker within the genomic sequence, the primer sequences used for amplification (both forward and reverse), the melting temperature (TM) of each primer, the played no part in study design, data collection and analysis, the decision to publish, or the preparation of the manuscript.number of alleles detected for each STR in the tested populations, and the name of the corresponding author.
For subsequent analyses, only the markers classified as perfect STRs were chosen and utilized.

Identification of microsatellites within coding sequences
The in silico identification of the sequences flanking each marker (i.e.: the amplicon sequence) was performed using Primer-BLAST (https://www.ncbi.nlm.nih.gov/tools/primer-blast/). The forward and reverse primers for each STR were used as the search input, the 'database' was set to "Refseq representative genomes" and the "organism" parameter was set to "Penaeus vannamei (taxid:6689)".The parameter "max target amplicon size" was adjusted to 500 bp based on the average size of the alleles previously described.
Sequences in which the primers showed a 100% identity match or had at most 1 base pair difference (SNP or deletion) were considered for further analysis.The flanking regions, including the primers, also known as putative amplicons, were manually extracted.This process involved using the provided coordinates of the aligned primers and ensuring that the entire amplicon region was captured, starting from the left-most coordinate, and ending at the rightmost coordinate in the corresponding sequence.In cases where confirmation of microsatellites was necessary, the Geneious software version 7.1.3[22] was employed.
Subsequently, the sequences of the putative amplicons were functionally annotated using BLASTx [23] against the 'nr' database, employing default parameters.Protein functions were assigned to the sequences based on their best alignment (best hit).This analysis aimed to identify putative genes encoded by the sequences aligning with each marker.By associating the marker sequences with known protein functions, we gained insights into the potential genetic elements represented by the markers.
Next, we determine genomic locations of these marker sequences in the P. vannamei genome, unveiling the associated genomic features based on the provided genome annotations [21].
For this analysis, we exclusively selected markers that yielded a single hit during the Primer-BLAST step to ensure accuracy and eliminate potential ambiguities caused by multiple targets within the same or different genomic scaffolds.In essence, we conducted a BLASTn search to obtain the genomic location of each marker, employing a 'best hit' approach.Subsequently, we examined the genome annotation file (GFF file) directly using the BEDTools software v2.26.0 to identify any overlapping annotations or features associated with the markers [24].First, BLASTn from the NCBI BLAST+ package v2.9.0 [25] was run locally (parameters: -task blastn -outfmt 6) to align the marker sequences to the P. vannamei genome, assigning each marker's location to the one in its best hit.
A second BLASTn analysis was conducted using the Lvan99 sequence, which exhibited 90% low complexity and initially failed to align.To overcome this issue, the second BLASTn search was performed with the same parameters as before but with the DUST filter deactivated (-dust no), resulting in successful alignments.The results from both BLASTn analyses were then merged.
For each aligned marker sequence, the corresponding subject ID (representing the scaffold ID), start and end positions of the alignment in the subject, and query ID (representing the marker ID) were extracted.These details were formatted into a BED format consisting of four columns.The BEDtools program was utilized to search for features based on this BED file, allowing for further analysis of the aligned marker sequences in relation to genomic features [24].
The association of markers to linkage groups was performed using the information provided in the P. vannamei genome [21].With the obtained BLAST best hits, we utilized the accession IDs of the subjects to retrieve the corresponding scaffold IDs.Subsequently, we mapped these scaffold IDs to their respective linkage groups, enabling the assignment of markers to specific linkage groups within the genome.
The BEDtools software was employed using the 'intersect' function, following the pseudocommand with the specified parameters: $ bedtools intersect -a <Intervals>.bed-b <GenomicAnnotations.gff> -wo.This allowed the identification of all genomic annotations within the P. vannamei genome that overlapped with the genomic regions where the markers aligned.Detailed information regarding these genomic annotations was extracted from the genomic feature file (GFF file) associated with the P. vannamei genome.The results obtained from the 'BEDtools intersect' analysis were subsequently analyzed using Python scripts, utilizing data analysis packages such as Pandas and Jupyter.
The protein identified through functional annotation using BLASTn from BEDtools was further subjected to a search in InterPro (https://www.ebi.ac.uk/interpro/).This allowed for the retrieval of Gene Ontology (GO) terms associated with the protein, providing insights into its molecular functions.

Selection of polymorphic STR markers, PCR and genotyping
After establishing the initial database, a series of filtering steps was applied to select the perfect markers for in vitro validation in the population of P. vannamei.The selected markers were those with the highest number of repeat motifs, the highest number of alleles, and located within coding regions.
DNA extraction was carried out using total DNA from 50 individuals of P. vannamei collected from shrimp farms in the state of Rio Grande do Norte, Brazil.Approximately 30 mg of abdominal muscle from each shrimp was utilized for DNA extraction, following the instructions provided by the Nucleospin1 Tissue Macherey-Nagel kit.
The selected microsatellite markers were amplified under the specific conditions previously described for each marker.Genotyping was performed using a 6.5% polyacrylamide gel stained with silver nitrate.The electrophoresis runs were conducted in a vertical tank (Owl Separation System, Thermo Fisher) for an approximate duration of 5 hours, with a voltage of 2000 V and a power of 400 W.
Allelic and genotypic frequencies and the number of expected and observed heterozygotes were calculated using the software GenAlEx v 6.503 [26,27].The Genepop v 4.7 software was used to calculate inbreeding coefficient values (FIS) for all loci in the population [28], available online at the website <https://genepop.curtin.edu.au/index.html>.Sequential Bonferroni corrections were performed to test the significance of the results obtained [29].

Perfect microsatellite markers in P. vannamei
We have identified 57 articles that describe microsatellites in P. vannamei from 1996 to 2021 (Table 1).Out of the microsatellites described, 306 have been characterized in population studies and exhibit polymorphism (S1 Table ), enabling a more comprehensive investigation in this study.Among these 306 microsatellite markers, 117 (38.88%) are considered perfect markers due to their single repeat motif (S2 Table ).
Within the perfect markers, the highest prevalence was observed for dinucleotide (46.2%), followed by trinucleotides (30.8%) and tetranucleotides (11.1%).The mononucleotide repeats consisted of thymine repeat motifs (T), ranging from 15 to 20 repeats.One microsatellite marker exhibited a hexanucleotide repeat motif (Fig 1A).The number of uninterrupted repeats ranged from 3 to 62, with the majority consisting of short repeats.Both groups with up to 5 repetitions and the category of 6 to 10 repetitions were more frequent, comprising 34 and 43 markers, respectively (Fig 1B).

Functional characterization of sequences containing microsatellites
Among the 135 sequences, 72 (53.33%) were associated with the prediction of gene features in functional annotation through BLASTn.Out of these, 62 were linked to predicted coding genes (mRNA), and 53 STRs were in exons.Furthermore, the linkage group with the highest number of STR markers was LG36 (n = 9), followed by LG20 (n = 6) and LG44 (n = 4) (Table 2 and S4 Table).These sequences were then assigned Gene Ontology (GO) annotations.
The GO terms for molecular function describe the biochemical activities of genes based on various biological features associated with a given protein, including "biological process," "cellular component," and "molecular function" [75].However, since there was insufficient information available for these proteins regarding the GO categories of "biological process" and "cellular component," they were not considered in our GO analysis.

Prospecting of STR markers and population analysis by genotyping
The nine markers located in exons described in Table 3 were evaluated in a population to verify their potential for population studies and potential variations in each coding region.
The average number of alleles per locus was approximately five, ranging from 1 to 8.However, the marker Livan44 did not exhibit any polymorphisms.Among the markers, Lvan93 displayed the highest heterozygosity (Ho = 1.000), while Livan04 had the lowest (Ho = 0).The markers Livan44, Lvan183, and Lvan93 showed significant departures from Hardy-Weinberg (HW) equilibrium based on the probability test.The markers CNM-MG-402 and CNM-MG-421 did not amplify in our population (Table 4).Although the primer Lvan204 could be amplified, population analysis was not conducted for this marker.
The inbreeding coefficient (FIS) was less than zero for Livan04, Lvan183, and Lvan93, indicating higher genetic variability and the coexistence of different protein isoforms within the population (Table 4).
The positions where these markers overlap in the exon are depicted in Fig 3 .As they are situated within exon regions, variations in the number of their repetitions can alter the codons, thus potentially modifying the function of the protein.The majority of these STRs were located at the 3' end of the mRNAs.

Discussion
This study aimed to gather and consolidate all the currently available information on microsatellite markers characterized for P. vannamei.We believe that this represents a significant milestone, as data on these markers were previously scattered and, in many cases, lacked information regarding their genomic location.This was primarily since most of these markers were designed prior to the first genomic draft of P. vannamei [21].The publication of the genomic draft showed the occurrence of 44 linkage groups, mapping to 44 distinct chromosomes.This allowed us to precisely map most of the markers to a linkage group, providing valuable information for breeding programs.Overall, our observations revealed a high occurrence of diverse repeat motifs (ranging in length from 1 to 6 bp) within perfect short tandem repeats (STRs), as depicted in Fig 1 .Among these, markers featuring dinucleotide repeats were the most prevalent, followed by trinucleotide and tetranucleotide repeats, which aligns with the findings reported in the original studies [21,69,72].
The main challenge encountered in this study was the task of establishing the linkage between these markers and their respective genomic locations.Up until now, no prior research has been conducted to connect the existing microsatellite markers (with available sequences) to their specific genomic positions.It was quite surprising to discover that numerous microsatellites in P. vannamei are situated within genomic regions that potentially bear some adaptive significance, as demonstrated in Table 2 and Fig    that out of the perfect markers, 62 (45.9%) were located within coding regions.In other words, these genes are associated with various processes such as host cell response to pathogens, energy metabolism, cytoskeleton organization/reorganization, protein folding, and the efficiency of translation mechanisms.
During our analysis, we observed redundancy and ambiguity in certain markers.Utilizing the Primer BLAST search, we found that the primers for the following five markers aligned with multiple locations in the genome.These locations could either be within the same genomic scaffolds or in different scaffolds, with the possibility of being on the same chromosome or not.The markers and their respective counts are as follows: CNM-MG-369 (n = 2), CNM-MG-444 (n = 3), CNM-MG-474 (n = 2), c7152/f1p0/6395 (n = 3), and c16138/f14p37/ 618 (n = 19).Further details can be found in S3 Table.This result confirms the existence of regions that are likely to arise from duplication events, a common feature in the P. vannamei genome also observed in other shrimp genomes [21].Consequently, the use of these markers requires caution since they might be inaccurate when used in the traditional way.
Five short tandem repeats (STRs) were successfully validated in a shrimp population, confirming the occurrence of protein variants within the population, as outlined in Table 4.Some markers within this panel have demonstrated potential for application in traditional population studies.However, we believe that the true value of this panel lies in the correlation between protein variations and phenotypic traits.The observed deviations from the Hardy-Weinberg equilibrium in certain markers could potentially be explained by specific selective pressures acting on these protein isoforms.The expansion or contraction of STRs can influence various biological processes, including gene transcription, splicing, and translation, thereby contributing to genomic plasticity [17].The following are examples of proteins that exhibit different isoforms within the analyzed shrimp population, along with their respective functions: 1. Clp proteases are ATP-dependent enzymes that play a crucial role in the cleavage of proteins and polypeptides by hydrolyzing peptide bonds.They serve as precise regulatory mechanisms involved in protein degradation and quality control processes within cells [76,77].
2. Sorbitol dehydrogenase (SDH) is an enzyme that relies on zinc as a cofactor and is primarily responsible for the conversion of D-sorbitol into D-fructose.This enzymatic activity holds significant importance in the early embryonic development of invertebrates [78].In a study conducted by [79], rainbow trout were used to investigate the effects of increasing stock density on sorbitol dehydrogenase (SDH) activity.The findings revealed that higher stock densities in tanks had a significant inhibitory effect on SDH activity.Consequently, this led to the accumulation of sorbitol, disrupting the physiological balance of the fish, and resulting in undesirable outcomes.In P. vannamei, the expression levels of SDH have been found to be associated with Taura Syndrome Virus (TSV) infection.
3. The 26S proteasome regulatory is a large multi-protein complex that selectively degrades ubiquitin-tagged proteins, and this process is ATP-dependent [80][81][82].It plays a key role in biological homeostasis since removing damaged or misfolded proteins [83,84].Furthermore, the ubiquitin-proteasome system may play an important role in crustaceans' immune response and can be a candidate gene involved in host viral disease defense [85,86].[87] found a Ubiquitin proteasome mediated response in animals infected with WSSV to reduce the deleterious damage caused by the viral infection.
4. Protein tyrosine phosphatases (PTPs) are signaling molecules responsible for dephosphorylating tyrosine residues in proteins and play critical roles in cellular regulation and cell adhesion [88,89].In a study conducted by [89] using P. vannamei, it was discovered that the expression of non-receptor protein tyrosine phosphatase 6 (PTP6) was regulated by the interferon regulatory factor (IRF).Additionally, PTP6 was found to promote STAT dimerization, leading to increased expression of genes associated with STAT-targeted immune effectors.This, in turn, enhanced the antiviral immunity of shrimp.
5. Inverted Formin plays a diverse range of roles in actin polarization and depolarization processes.It is involved in various cellular processes, including cytokinesis, cell adhesion, cell motility, endocytosis, among others.Additionally, studies have reported its interaction with microtubules, which contributes to microtubule stabilization [90].

Conclusion
This study provides an updated characterization of previously validated microsatellite markers in population studies for P. vannamei.Among the 306 markers analyzed, 117 were identified as perfect repeats/markers, with 62 located within coding regions of the genome.Functional characterization revealed their potential influence on cellular processes, including binding function, catalytic activity, and transcriptional activity, which may be linked to phenotypic traits.Additionally, a panel of five markers located in coding regions was validated in vitro, confirming the presence of different protein isoforms within the P. vannamei population.The results highlight the potential of coding region markers in establishing direct associations between phenotype and genotype.The validation of six markers in vivo further supports this notion.Moreover, the study suggests the possibility of exploring the remaining untested markers through in silico functional predictions, which would expand the potential for phenotypegenotype associations.

2 .
Functional annotation analysis indicated

Fig 3 .
Fig 3. STRs located in exons within the P. vannamei genome.The marker sequences (shown in blue) represent the entire amplicon and include the primers (dark green and light green).The repeats within the markers are also indicated (orange).Exon annotations are presented in black.https://doi.org/10.1371/journal.pone.0289351.g003